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ABSTRACT 



We assume that p random variables, , are distributed 

according to some multivariate normal distribution (called the p 
variate normal). Methods of predicting the value of one, say, yp, 
given the values of the other p-l variables are discussed, A study 
is made of the problems encountered whenever one tries to reduce the 
number of variables used to predict yp and at the same time minimize 
loss in prediction accuracy. Modifications of the step-wise proce- 
dure of adding predictor variables one at a time are considered in 
some detai I, and methods of using an automatic high speed electronic 
computer to perform the numerous calculations involved are described. 
A high speed computer program was written to generate samples from 
any specified p variate normal. 

I wish to express my sincere gratitude to Professor Jack R. 
Borsting, who in class introduced me to many of the mathematical 
concepts used in this paper, and as faculty adviser provided the 
guidance necessary to apply these concepts; and to Mrs. Bette Joe, 
for her most capable typing of this paper. 



TABLE OF CONTENTS 



Chapter Title 

l« Introduction 

2. The P variate Normal Distribution 

3. Statistical Analysis of the P Variate Normal 
4* An Example 

5. Reduction in the Number of Variables in Regression - 
Introduction 

6. The Step-Wise Procedure 

7. Automatic Regression Analysis - Criteria for Halting 
Step-Wise Regression 

8. The MV REGRESSION and MV SIM Computer Programs 

9. Current Studies and Proposals for Future Research 
BibliCxgrafih^i o- nr;!,y 

Appendix 

A, Generation of the P variate Normal by Program MV SIM 

B. Tests of Sample Mean Vector, Z, and Sample Variance- 
Covariance Matrix, S 



• • • 
1 1 1 



Page 

I 

h 

10 

i4 

20 

26 

32 

in 

51 

61 

62 
67 



Chapter I 
INTRODUCTION 



The multivariate normal distribution with p variables, referred 
to here as "the p variate normal" has been found to be useful as a 
model for a wide variety of real world phenomena. This distribu- 
tion has been studied intensely in the literature and has many "nice" 
mathematical properties. 

One of the p variate normal's most useful properties is the 
fact that when q of the variables are fixed, the remaining p-q 
variables become a p-q variate normal, which has the same variance- 
covariance matrix regardless of the actual fixed values of the first 
q variables. Where q equals p-l the variable whose value is not 
fixed, say yp, becomes a conditional normal random variable whose 

variance is less than the variance of y„ when the variables 

T P 

y I , * . . ,y p j are not fixed. 

In chapter II, methods of "predicting" yp from known fixed 
values of the other p-l variables are described, and methods of 
measuring the accuracy of prediction in terms of variance of yp 
are given. These methods requi re that the p variate normal be 
specified completely by a mean vector, U, and a variance-covariance 
(V-C) matrix, , In chapter III, methods of approximating the 
work of chapter II using sample estimates of U and are described. 
These ideas are illustrated by an example in chapter IV, 

After mastering the technique of regressing p-l variables to 
form a prediction equation for the last one, yp, we turn to the 
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problem of eliminating variables that may not be useful in predicting 
the value of yp. Variables are eliminated by removing all reference 
to them before the prediction equation is computed. Reasons for re- 
ducing the number of variables in regression are presented in chapter 
V. Later in chapter V, the process of eliminating variables from 
regression is illustrated by an example using a specified five* variate 
no rma I • 

At present, the only known way to find the "optimum set" of 
r (rf p-l) variables is to compute all regressions. Obviously 

this involves extremely large numbers of computations for large 
p, so that methods involving fewer computations are normally used 0 
Generally these faster methods produce "good" combinations of vari- 
ables in regression but often they are not the "optimum" combination 
for the same number of variables in regression. 

Chapters VI through IX discuss methods of searching for a 
satisfactorily small set of variables in regression that will reduce 
the conditional variance of yp to a satisfactory level. The step- 
wise procedure, described in chapter VI, provides the basic proce- 
dure under study throughout the rest of the paper. Basically, this 
procedure consists of adding variables to regression in steps. At 
each step, the variable to be added is selected because its contri- 
bution to variance reduction is greatest at this step. That this 
procedure does not always produce optimal combinations of variables 
in regression is demonstrated. 

Also in chapter VI a statistical test to be applied at each 
step when a sample is being studied is described. This test provides 
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a criterion for halting the step-wise process which is a function 
of sample size, n. 

In chapter VI I automatic regression analysis performance by a 
high speed digital computer is discussed. Additional halting 
criteria and other improvements to the step-wise procedure are 
suggested. Halt criteria proposed by Miller ^7] and Efroymson 
are reviewed in light of automatic regression analysis requirements, 
A modification to the step-wise procedure reflecting differences in 
cost of observation of variables is considered. 

In chapter VI II computer programs MV REGRESSION and MV SIM, 
written by the author, are presented. Basically, MV SIM generates 
samples of a specified size, n, from a given p variate normal with 
which MV REGRESSION performs regression analyses, MV SIM also 
computes regression parameters of the given p variate normal, the 
results of which may be used as standard for comparison purposes 
with results of regression analysis of the samples. 

In chapter IX current and proposed studies using these high 
speed computer programs are outlined. 

Appendix A describes the operation of program MV SIM in detail 
and some background on the techniques used by MV SIM to generate 
samples from specified p variate normals. 

Appendix B describes statistical tests performed by MV SIM 
on sample vectors, Z, and sample (V-C) matrices, S, Results of 
tests performed on a number of generated samples of different sizes 
of a five variate normal and an 18 variate normal are given. 
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Chapter I I 

THE P VARIATE NORMAL DISTRIBUTION 



In this chapter we introduce the multivariate normal dostribu 
tion with p variables, hereinafter called the "p variate normal" 0 
The basic theory associated with the p variate normal is given in 
detail by Graybi I I £lj.j and Anderson £lj. Certain theorems and 
formulas that are important for later work on regression analysis 
are given here. 

A p variate normal is completely defined by any specified 
pxl vector of means, U, and any pxp positive definite symmetric 
variance - covariance (V-C) matrix, £ . Lets 



w 

q ' 



0 , 





fV|) 




' u ll 


„ 1 


Y = 


Sol 


U = 


# 


. z- ( 



4 * 

^pl ' "^pp 1 ' 



The joint density function of the p variate normal, Y, is given 
by: 



(2.1) f(y ( ,...,yj - 



|/2(Y-U) T £ ‘(Y-U) 



p ' ( 2 m lp/2 MEl (l/2) e 

for - co -s yj « oo , i = l,...,p. 

The element O'] j of £ is the covariance between variables yj 
and y ., and u. of U is the mean of y.. 

J 

If the pxl vector Y is partitioned into two subvectors such 



that: 



k 



Y l 



\Y 2 / 



, (vectors Yj and are (p-q)xl and qxl respective! 



<T -= P), 
and If: 



^2 



and 



£|i £12' 

Ul ^ 22 ' 



are the corresponding partitions of U and £, then it can be shown, 
[/+] section 3*6, that the conditional distribution of the qxl vector 
Y^ given the vector Y| « Y|* (a constant vector), Y^IYj*, is .the 
multivariate normal distribution with qxl mean vector 



and qxq V-C matrix 



U 2 + ^-21 ^1 I ^ Y l “ 



^*2 " ^21 ^ I I £| 2 * 



From the latter matrix we see the important fact that the co- 
variance matrix of the conditional random vector Y^IYj* does hot de» 
pend upon the value of Yj *. 

We shall represent the qxq V-C matrix of Y^IYj* ass 



( 2 . 2 ) 



.-I 



' 22.1 



Z 2 2 “ J-21 J-N £l2 



In particular, each element. O’.. . „ . of this matrix 

(i ® p-q+l,...,p) is the conditional variance of variable y. in Y^, 

i .e. the variance of y. when the p-<j variables in Yj are fixed. 

The element ^7. in the specified V-C matrix, £ , is the variance 
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of yj in the original p variate normal distribution,, That OT . is 
greater than or equal to Ojj | p_^ follows from formula 2.2 

above, and the fact that ^21 ^-'11 ^-12 ' s P os '^* ve definite. In 
fact, the following relationship holds (where 0 2 Rj f l)s 



°Ti.i,...,P-q " (UR ?) °~U 



In this formula R. is the multiple correlation coefficient between 
variable y. and vector Ygj see section 3«6® 

In this paper we wi I I consider only the case where qi = I, 

Now Y * | ' ], where Y is still px I , Y^ is (p-l)xl, and Y^ is the 




variable yp. Similarly, we partition Y* 




, U, and L so that 



elements Y^, U 2 , and £ 22 becorrie Yp» u p » and 0"pp respectively. 

It follows from earlier discussions that the distribution of 
yp I Y | * is the univariate normal distribution with (scalar) means 



(2.3) 



I 



p. I ,...,p~ I 



U 2 + ^-21 £| I ^ Y l " U l^ 



-I 



% + I 2 . £|| ‘ U l>* 



and scalar variance: 

(2.4) £pp.|,...,p-l = T>22 ~ ^21 ^11 ^12 



-I 



r l 



^pp “ ^21 ^1 I £|2 
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n H} \ ^ r”' T 

Let jj be the (p-l)xl vector jj J = ( ^21 2-||) * From 




formula 2.3 , we can write: 

(2.5) y Iy,* - u + /?Ty,* - U,) + e 



P-l 



" u p + A (Vi* - u i) + 



p-l p-l 

U p - Z A U| + Z A Vi* + e. 



i = 1 



i = l 



where U| i s sti 1 1 



“I 



\ u p-i 



> Y, 




, and e is a normally distri- 



buted random variable with mean zero. The variance of e is 

fr , ., the value of which is independent of the actual values 

^ / pp.l,...,p-l r 



r if ♦ 

° f Y| » • • • » y p_ j 



We define formula 2.5 as the prediction equation for associated 
with the p variate normal . Most often we shall use it in the form: 

p-l 

(2.6) y p l Y,* - e - E(y p |Y 1 *) - u p + Z A’^i* - u.). 

Now, if we know the fixed values of Y|* (in addition to U and J ) 
we can use 2.6 to compute the mean of the conditional random 
variable y p | Y j * • A measure of the rt error w involved in using the 
results of 2.6 to "predict" the value of y , when Y ( * is known, is 
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given by Q ~ j ^ j . By comparison, if the value of Y^* is not 

known, one might use the original mean of yp, Up, to "predict" the 

value of yp. The corresponding "error" of this prediction is given 

b y CZ„, which is greater than i „ .. The values of the 

PP pp.i,...,p-i 

scalars, J3]» in vector J3 are called partial regression coefficients , 
Suppose the computed values of some of the partial regression 
coefficients J3j » etc... are zero, or close to zero. Then, 

obviously, insofar as estimating yp is concerned, one can save the 
effort and cost of observing the values of yj, y^. 

It often happens, especially when the number of variables, 
p, is large, that some of the variables, themselves, can be pre- 
dicted rather accurately by a linear combination of other variables. 
This shows that even if none of the partial regression coefficients 
are close to zero, it may be possible to observe only a select few 
of the variables and still predict yp nearly as accurately as when 
all of the variables are used. 

Of course, the values of the partial regression coefficients 
to be used with each variable depend upon which other variables 
are used in combination to predict yp. Throughout this paper, any 
combination of the original p-l variables that are used to predict 
yp in the manner just described will be said to be "in regression". 
The variables whose values are not to be used to predict yp we shall 
say are "not in regression". 

Once a combination of variables to be i n regression have been 

/ r* / 

chosen, a modified mean vector U and V-C matrix ^ are formed from 
the original U and £ respectively by removing the u. from U and 

J 
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O ' . and O 7. (for all i and k) from for each variable y? that 
1 J J K J 

is to be "not in regression", (if q of the p-l variables 
y, j are to be "not in regression", then U / is (p-q)xl and 

Z / 

is (p-q)x(p-q)). Thus, we see that all reference to those 
variables not ?n regression is completely removed and a new p-q 
variate normal is defined by U / and L, , from which new prediction 
equations ( 2.5 or 2 . 6 ) can be computed. Note that it is possible 

p-l 

to make up ^ ( p "j*) prediction equations for predicting variable 
>1 

yp, one for each possible combination of variables y 1 9 • • • # I * 

In chapters which follow we wi I I discuss methods of estimating 

the partial regression coef f icients, JJ-., and Q~ . , etc., 

when the values of U and £ are not known. Methods of choosing 

which variables are "best" to use in regression will be discussed. 

We shall also consider the problem of specifying relative "cost" 

of observation per unit reduction in C~ , . 

K '“'pp. I,...,q* 
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Chapter I I I 

STATISTICAL ANALYSIS OF THE P VARIATE NORMAL 



In this chapter we assume that Y has a p-variate normal distri- 
bution with unknown mean vector U and V-C matrix We are now 

concerned with methods by which an experimenter can estimate U and 
£, and subsequently, other parameters, such as regression coeffi- 
cients for prediction equations for predicting ypj and \ f 0 o o s q» 

the conditional variance of ypl^* when variables y|,...,y are in 
regression. In order to distinguish estimates of parameters from 
their associated theoretical values, it is convenient to develop new 
notation to be used throughout this paper, listed here for easy 
reference: 



Notation for 
Theoretical Values 



TABLE I 



Meaning of Parameter 



Notation for 
Associated Estimated 
Parameters 



U 



pxl mean vector of the p 
variate normal 



Z 



I 



pxp V-C matrix of the p 
variate normal 



J3 (beta) 



qxl vector of regression 
coefficients associated with 
q vectors in regression 



S 

B 



O* The element in row p, column s^ 

PP PP 

p of 2 - > which is the (uncon- 
ditional ) variance of yp, yp 

is arbitrarily chosen to be 
the variable to be predicted. 






The conditional variance of 
y p |Y|*, where Y|* is a qxl 
vector of fixed values of 



pp» l,,«o,q 



y i » •• •> y q 
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A sample of size n can be arranged into nxp matrix form 



as fo I lows : 



y n# y l2 > •••» y | p 



y 2 | * y 22* ° » y 






y nT y n2 



2p 



np 



where y .. represents the j th observation of variable y*. Note 

J 

that for this sample, observations of yp, the variable later to 

/ 

be predicted, are also required. Sample means are computed ass 

n 

" ii 

y i = — » for i = I, 2, ..., p< 



Z yji 



Sample covariances as: 



! ik = 



Z (/ji - 7|! 

j 



) ( y jk “ y k) 



n - I 

for i, k = I, ..., p. 

For i * k the sample covariances become the sample variances! 



s ii = 



z 

j=*l 



f 

(y ji - y i )‘ 



n - I 



By analogy to the mean vector U and V-C matrix, J] , we form 
the pxl sample mean vector , Z, and sample V-C matr? x, S, as fo I lows 




II 



It can be verified easily that Z and S are unbiased estimates 



of U and 51 respectively; and that Z and £(n-l)/nj»S are maximum 
I ike I ihood estimates of U and 51 . 

To develop estimates of the parameters of the conditional dis- 
tribution of y p I Y | * we recall that the random variable yp|Y|* is 
normally distributed with mean and variance given by equations 2»3 
and 2.i+. We partition Y, Y*, Z, S, as we did Y, Y*, U, and 51 » 
respectively in Chapter II: 




where, as before: 



H v - 


Y l*\ 


e 


* 


4 * 


1 

Cl 

>- 


iy p -i/ 



(constant vector) 



and. 




» • • s 



I p-l 



k # • # s 
I p- I p- I 



> S I2 = ( S 2|) 




>1 P 



Since z: and ( n ~ * ) • s? • are maximum likelihood estimates of u. 

1 n 1 J i 

and Cjj respectively, for i, j = I, p, it follows from the in- 

variant property of maximum likelihood estimates that: 

2 p + S 2I S li ( Y * “ z |)» and [ s pp " S 2I ^11 S 1 2 1 ^ 

a re maximum I ike I ihood estimates of u + Y. 2 \ 51 1 | (Y* - Uj ) , and 

-\ 

£pp - 2-21 ^-|| “12 respectively. 
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Let: 






(3.0 


z p.l,.. 


.,p-l " z p + S 2I S ll ( Y l “ Z 0 


and. 






(3.2) 


s pp.l,. 


-1 

..,p-l “ s pp ” S 2I S l 1 S I2 • 



It can be shown that z_ i _ . and s or , ■ „ . are 

unbiased estimates of u p< j ^ # ..^p- 1 » and £pp. I,...,p-|* the mean 
and variance of the conditional random variable y p |Yj* respectively. 



Simi larly, i f we let B 



b l 



-kT 



, ' ^ | - ( S 2 1 S l |) » 8 5 



is a maximum 



I ike I i hood esti ma tor of J3 « (Y, ^ £ j ( ) j that is, b j is a M.L.E. of 
J3 j for I = I,.. ,,p-l. It can be shown that B is also an unbiased 
estimator of J3 • 

We can write B in the form (since SjJ is positive definite): 

(3.3) B = SjJ S I2 ; 



. b l 



ip : 



s i i ••• s i p-i 

0 * ~ 

# ' 

t 0 

V / \ S p-l l*“ S p-l p-l / \ s p-l p 
Equations 3*3 are called the normal equations . 

Substituting z p# , j##<#p _, for u p# , ^ , we obtain an un- 
biased estj_rnat£ for the value of the prediction equation, 2.6, by: 

(3.10 rTv^ivT) » z. 



*p# I # • • • I 

z p + S 2 I S l ! ( Y l ■ 2 |) 

p-l 

z p + bi (y * “ z i } * 
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Chapter IV 
AN EXAMPLE 



Assume that an experimenter wishes to gather data from some 
process involving five variables, which he assumes to be related 
according to a five variate normal distribution. Suppose this five 
variate normal actually is defined (completely) by the following 



theoretical vector, U, and V-C matrix 



, 1 . 



I 



(if. 0 



f 7.1+600 ^ 




' U 'l 


1+8.1500 




U 2 


• 

--0 

0 

0 


M 




30.0000 




U l+ 


k, 95.1+200 ; 




< U q / 



- 

r\ 



(1+.2) 


< 






• 7 






3I+.6025 


20.9233 


- 31.0517 


- 2I+.I667 


6I+.6633 




20.9233 


2 1+2. 11+08 


- 13.8783 


- 253.1+167 


191.0792 


I - ^ 


- 31.0517 


- 13.8783 


1+1 .0258 


3.1667 


- 51.5192 




- 2I+.1667 


- 253.1+167 


3.1667 


280. 1667 


- 206.8083 




61+.6633 


191.0792 


- 51.5192 


- 206.8083 


226.3133 







Using developments of chapter II, we let Y 



I. The value of U and £ used here were computed as sample vector Z. 
and V-C matrix S using data from table 20.1+, page 61+7 of HALD [hJ. 
The results of tables 20,5 and 20,6 of Ha Id were used to verify the 
results of computer program MVSIM, which performed most of the 
computations required for this paper. 



We know that we are going to be given values of yj, yg, y^, y^, 
from which we will predict y^. Hence, we must set up the prediction 
equation for y^. (equation 2 . 6 ). Accordingly, we partition U and X 



as: 





7 . 4600 ' 








< 


48.1500 


> 








1 1 .7700 




s 


U . 




. 30 . 0000 - 










( 95 .^ 00 ) 






A 



- 34.6025 20.9233 - 31.0517 - 24 . 1667 ' 




64 . 6633 ' 


20.9233 2 i* 2 .l 408 - 13.8783 - 253.4167 


> < 


191.0792 

> 


- 31.0517 - 13.8783 41.0258 3.1667 




- 51.5192 


.- 24.1667 - 253.4167 3.1667 280 . 1667 ^ 




1 - 206 . 8083 j 


( 64.6633 191.0792 - 51.5192 - 206 . 8083 ) 




( 226 . 3133 ) 



For J2 , we get 



Z|| Z | 2 
‘Z 2 | Z 2 2 



/me 8 . o 



' 1.5513' 




rA] 


.5103 

< 


> =< 


A. 


.1021 




A 


- .i438 J 




a) 



The prediction equation for y^ becomes: 
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1 * U 

(b.3) e (y 5 I y , *) - ^ - Y, A u i + Z P iVj* 



i = l 



i — I 



95.1*200 - (1.5513 .5103 .1021 - .11*38) •< 



7.1*600' 

1*8.1500 

> 

1 1 .7700 
l 30.0000' 



u * 



+ (1.5513 .5103 .1021 - . ll*38 W 2 \y 



V 

v' 



or, 

(l*.l*) 



E (y Iy,^ = 62.3881 + 1.5513 y,* + .5103 y * + .1021 y * - .11*38 

The variance of y^|V|*, C 55 I 2 3 1* * (the conditional variance 
of y 5 given y,, y 2 , y y y^,), is: 



-I 



^55.1,2,3,1* = ^-22 " ^21 £|l £ 



12 



= 226.3133 - 222.321*2 = 3.9891 

f 

which is.„.a measure^of the prediction error when formula l*.l* is used 
to predict y^ when Yj* is knpwn. By comparison, i^ the values of Y|* 
were ignored and if, instead, the value u^ = E (y^) = 95.1*200 was 
always used to estimate y^, the correspond! ng measure of the predic- 
tion error would be Qz - 226.3133. 

55 
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as 



Thus, by knowing the mean vector U, and the V-C matrix £ , 
given by formulas 1|.. I and U» 2 , we can set up the above prediction 
equation, i+.U. Then for any set of values y ( , y^, y^, y^, we can 
make an accurate prediction of y^ without observing its value. 

The problem facing the experimenter is more complicated than 
the one discussed in the preceding paragraphs. This is because 
he does not know the values of mean vector, U and the V-C matrix, 

X!. All he knows is that (by assumption) y ( , y^, y^, y^, y^ are 
distributed according to softie five variate normal distribution, and, 
therefore, are completely specified by some theoretical mean vector 
U and V-C matrix £ whose actual values he will never know. 

Assume the experimenter draws a sample of size 500 from this 
five variate normal distribution (specified by equations i+o I and 
h* 2 ) • He then computes all sample means, variances, and covari- 
ances (7. = z. , s.., s. . respecti ve ly)' and forms the sample mean 
I II I J 

vector, Z, and sample V-C matrix S as defined in chapter III. 
Suppose, as an example, he obtains the following results upon draw- 
ing a sample of size 500: 





' V 


- 


'V 




r 7 . 7762+ 1 




< 


V 


♦ 4 -\ 






2+8.7 1 55 


► 




Z 3 


a 


h 




11.5687 






V 




1 V 




■ 29.3039 J 










ig 




( 96.2+8 16 ) 
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M 



< /> 


CO 

ro 


S 2 I 


S 22 



3I4..6160 20.1I495 - 32.82I48 - 21.36361 

20.1I4.95 217.9056 -18.3144.3 -223.1718 

- 32.82I48 - 18.3I4I43 i+2.8371 






7 . 14-936 

- 21.5656 - 223.1718 7.14936 2I42.3209J 



6I4..0139 

171.9229 

- 57 . 1+059 

- 18O.I4.153J 



( 6U.0139 171.9229 - 57 .1+0 59 - l8O.l4.153 ) ( 211.0392) 

To estimate J3 by B we compute 



B= hi s 7i ] T ■ ' 



1.6360 

.5921 

.1771+ 

- .0590 



Hence, the estimate of the prediction equation 3 * 1 + becomest 

1 + 1 + 

ehjV) - *5 - Z b, r, + L b, y,* 



5 1 



i-i 



i-i 

* 



I ' I 



51 +. 592 I + I.6360 y, + .5921 y* + . 1771 + y 3 - .0590 y^ 



The unbiased estimate of the conditional variance of 
would be: 



y 5 l' 



- s - s "' 3 



55. 



55 21 



1 .-^ 

12 J n-p-l 



1+99 

I4.O368 x = I4.O69I4 



Now the experimenter is in a position to predict the value of y 
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For example, suppose he 



given a set of values yj*, y 2 *» y^*, y^*. 
is given that the yj* = u., the true means of the yj* (Of course, 
he doesn’t know that these are true means). 

Using the true prediction equation, we get (see 4.3)* 

4 k 

E (yjv,* - U) - u c-Ys A u i + ^ A U i = 95.419^. 

7 7 i = I i = l 



The experimenter would estimate this value as: 




54.5921 + ( I.6360) 

+ ( .5921) 

+ ( .1774) 

+ ( - .0590) 



( 7.4600) 

( 48 . 1500 ) 

(11.7700) 

(30.0000) = 95.6243 . 



19 



Chapter V 



REDUCTION IN THE NUMBER OF VARIABLES IN REGRESSION - INTRODUCTION 



Experience has shown that when the number of variables, p, is 
large, say over 20, usually a relativel y smal I number of variables,, 
can be found to use in regression to predict y^ nearly as accurately 
as when all p-l variables are used in regression^ * page 20 « 
Finding such a small combination of variables is desirable fora 
number of reasons: 

l) The prediction equation, has fewer terms; thus it is 






easier to compute a predicted value of y^, 

2) Fewer variables need to be observed in order to make 



a prediction of yp. Presumably this would result in 
reducing the cost of observing variables for each pre- 
diction of yp, 

3) When p is large, the prediction equation involving p-l 
variables requires many computations. Step-wise pro- 
cedures, described later, when yielding a relatively 
small number of variables in regression produce a pre- 
diction equation with much less effort. 

U) When the regression is being performed on a sample, 
variables that do not contribute much variance reduc- 
tion of yp can actually cause the prediction equation 
to yield a worse fit to the underlying (specified) p 



- 7 * 



variate normal than would result if they were omitted 
from regression. The reason is that the longer equation 
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can overfit the sample and ascribe some of the variation 
due to small scale random fluctuations to one of the ore- 



dictors "by accident”. 

As one would suspect, whenever a single variable, y^, is added 

to regression the old conditional variance, say, . is 

' '■'pp. I 

a Iways greater than or equal to the new conditional variance, 

i r.» However, usually the amount by which Q~ , 

pp. I , ... ,q,k 1 1 pp.l,.o.,q 

is reduced becomes small as the number of variables in regression 
increases, even though optimal combinations for each number of 
variables in regression are used. To illustrate this idea, let us 
consider an example of a regression problem under ideal conditions. 
That is, we shall examine a p variate normal specified in terms of 
vector U and matrix Y • 

We first compute the prediction equation for yp, and the 

associated conditional variance Q- n . „, for each possible 

pp. i , . • o ,q 

P-l 

combination of variables y ( . *Yp_ | ' n regression ( f5 ” 1 ) 

>1 J 

sets of prediction equations to solve). We shall then group the 

results according to number of variables in regression, and from 
each group pick the "optimal” combination of variables in regres- 
sion}: that is, the combination of variables, say, yj,...,y^, in 

regression producing the smallest OZ n , „• * 

3 K 3 pp.l,...»q 



I. Clarification of notations The reader should understand that 
whenever a "combination of variables in regression, say, yj,..«,y w 

and the associated conditional variance, ” sy~ , , is discussed 

as in the preceding paragraph, the q variables in regression are not 
necessarily meant to be regarded as the first q' variables as defined 
by position in the original vector U and matrix In other words, 

in order to ease notational difficulty, variables in regression are 
tempo rari ly relabeled y ( ,...,y . 
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With this grouping, we can now start with one variable in 
regression and add to the number of variables in regression one at 
a time, each time choosing the "optimal" combination of variables 
for that group, until we decide that adding more variables to regres- 
sion will not reduce Q~ , enouqh to make it worthwhile. 

In our example, we shall use the five variate normal as defined 
by If.* I and i+,2. To compute the prediction equation using only yj 
in regression, the prediction equation becomes: 



E (y 5 l Y l* ) = u 5 - A u | + P\ v* 

' -I T 

-here /}, - /}-[Z 2l I,,] = ^33 • - 1.8687, 

and C 55-l " £-22 " ^-21 £|l T|2 



« 226.3133 



6JL4..6633 X 6 k » 6633 

3 l |..6025 



l05.i+739 . 



Similarly, we compute partial regression coefficients 
for all 15 possible combinations of the variables y^ y^, 
in regression. Table II shows the results. 



* fil* 

y 3* y b 
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Table II 



Variables 

in 

Reg ression 


q 


Partial Regress 

A A 


ion Coefficients 

A A 


Associ ated 
Condi tiona 1 
Vari ance 

^55° 1 » . . » »q 


y l 




1 


1.868? 








105.1+739 


y 2 




1 




.7891 






75.5280 




y 3 


1 






- 1.2557 




161.6167 


* 


y l+ 


1 






- 


.7381 


73.6553 <=T 


*y, y 2 




2 


l.i +683 


.6622 






1+.826I ' 


*1 


■> 


2 


2.3125 




.1+91+5 




102.2551 


Ch 


2 


1.1+399 




«=» 


.6139 


6 . 2303 ^- 


y 2 


y 3 


2 




.7313 


- 1.0083 




31+.6208 


y 2 


y l+ 


2 




.3108 


- 


.1+569 


72 . 1+065 




y 3 \ 


2 






- 1.1998 - 


.721+5 


I1+.61+1+7 


y, y 2 


y 3 


3 


1 .6959 


.6569 


.2500 




1+.0096 


*y, y 2 


Qk) 


3 


1.1+519 


. 1+160 


- 


.2365 


3.9982^T" 


( 5 ) 

y 2 


y 3 y h 


3 

3 


1.0518 


- .9231+ 


- . 1+100 - 

- l .1+1+79 - 


.61+27 

1.5570 


I +.2368 

6.1506 


♦y, y 2 


y 3 y l+ 


1+ 


1.5513 


.5103 


.1021 


.11+38 


3.9891 


Note: 


Each group 


i s i denti f i ed by the 


value of q. 







* Indicates the "optimal" combination of variables for the group 
(for that number of variables in regression). 
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We now select the optimal combination from each group of variables 
in regression as follows (we omit the partial regression coefficients)? 

Table III 



q 


Variables in Regression 


Associated 
Condi tiona 1 


Group Number 


(Optimal Combination) 


Variances of y,_ 



0 


None 


^55 


- 226.3133 


1 


y 4 


°55»k 


- 73.6553 


2 


v r v 2 


°55*l,2 


= 4.8261 


3 


V\> v 2 » vi* 


°55- 1,2,4 


= 3.9982 


k 


v\» v 2 ’ y 3» ^ 


^5-1, 2,3.1. ■ 5- 9891 



From table III we immediately see that most of the reduction of 
the conditional variance of can be done by introducing only two 
of the possible four variables into regression, namely yj and yg. 
Very little more is accomplished by using the other two variables, 
given that yj and y^ are going to be used in regression. 

Note that the five variate normal is easily handled by an elec- 

h 



tronic computer because only 




15 prediction equations had 



to be computed, none with more than five variables involved. On 
the other hand, the 18 variate normal, for example, requires 



17 

£ ( 'j> 
J-l J 



131,071 prediction equations, most of which involve many 



variables. Hence this procedure is not always feasible even when 



today’s high speed electronic computers are available. 

It is interesting to note that when all four variables are in 



regression the values of the regression coefficients do not suggest 
which variables might be best to eliminate from regression. In fact* 
none of the values are close enough to zero to indicate that any 
should be removed. 

In this chapter it has been shown that we can expect the amount 
of reduction in the conditional variance of to be less per variable 
added to regression when the number of variables in the optimal com- 
bination becomes larger. Thus, if one were willing to state in advance 
his maximum allowable value of the conditional variance of yp, the 
problem would be a straight forward one of searching table II for the 
minimum number of variables producing that conditional variance or 
less. We now restate this same problem in the above termss 

n To find some satisfactorily small number of variables, 
q, (q :* p-0» that, when used to predict y p , reduces, Cpp, | ,,,,,q» 
to some satisfactorily small fraction of the unconditional variance 



Chapter VI 

THE STEP-WISE PROCEDURE 

We now discuss an alternate procedure of searching for optimal 
combinations of variables in regression, called the step-wise proce- 
dure. This procedure has the advantage of reducing the number of 

p-l 

prediction equations to be solved from ^ (P“*), as in chapter V, 

r=l 

to p-l or less, thus keeping the number of computations to within 
the capability of today's high speed electronic computers. We shall 
see that the combination of variables selected by this method is not 
always optimal, i.e., it is possible that a different set of the same 
number of variables might yield a more accurate prediction equation 
for y^. However, practical experience indicates that sets decidedly 
better than those discovered by the procedure outlined in this chapter 
are rare £a|, page I9» We sha 1 1 • discuss additional problems en- 
countered when the step-wise procedure is applied to a sample. The 
need for statistical tests at each step is demonstrated and an actual 
test is developed. 

The step-wise procedure is as fol lowss At each step every vari- 
able not yet in regression is examined to see how much the conditional 

variance of y would be decreased if it, alone, were added to the 
r P 

variables already in regression, i.e., assuming q variables are already 

in regression, the quantity ry- . - ry- is computed 

pp. I ,...,q qjp. I,...,q,m 

for each variable, y m , still not in regression. The variable to be 
added to regression is y^, the variable for which this computation is 
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greatest; i.e., y^ is chosen from the variables not in regression, y^ 
so that 



<2pp.l,...,q ^pp, I ,, ,, ,q,k m ^ X [C'pp. | ,.„.,q ^pp. I , . . . ,q,m 
or equivalently, such that: 

^pp • I , • • • , q ,k ^ t ^PP* I , • • • , q, m ]* 

We illustrate this procedure by applying it to the p variate 
normal specified by equations l*. I and l*. 2. This illustration can 
be followed most easily if reference is made to table II of chap- 
ter Vt 

Step I: Compute all four conditional variances of group I, 

and choose the smallest value (73*6553)* 
action: add variable yj^ to regression 

results: variables in regression: y^ 



c? 



55-b 



73.6553 



Step 1 1 : 



Compute the conditional variances of group 2 that 
include variable y^ in regression, and choose the 
smallest value (6,2503). 
add variable y | to regression 
variables in regression: yj, y^ 

CZr- . | » 6.2305 

55*1.4 

Step III: Compute the condi tiona I variances of group 3 that 

include variables yj and y^ in regression, and 
choose the smallest value (3.9982). 



action: 

results: 
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action: add variable yg to regression 

results: variables in regression: y^, y^ 

° h - 1 , 2,4 - 3 *" 82 

Step IV: Add the last variable 

results: variables in regression: y^ t y^, y^» y^ 

C 55* I# 2,3,4 =3.9891 

As in the preceding chapter, we immediatefy see that most of 
the conditional variances of y^ can be eliminated by using only two 
of the possible four variables in regression. However, this time 
the pair chosen were variables y | and y^ instead of yj and y^j pro- 
ducing a conditional variance of 6,2303 instead of !j.,826l. 

The step-wise procedure is equally applicable to analysis of 
a sample of size n. In this case all information is obtained from 
the sample vector, Z, and sample V-C matrix, S. In particular, 

the values of the sample conditional variances, s . , rather 

PP* i » • © o , q 

than C* , are used at each step to determine the next vari- 

'-'pp.l,...,q 

able to enter regression. As before, p-l prediction equations, 
and associated estimated conditional variances of y , s . , 

can be obtained. Each succeeding equation will contain one more 

variable in regression, and usually will have a sma I ler va lue of 

I 

s , , Now, as in chapter VI, the most acceptable combination 

pp. i , « , • ,q 

of variables in regression, for which the estimated conditional vari- 
ance of Yp is small enough, can be chosen, 

I. The exception can occur when the sample size, n, is small. See 
Ha Id's example table 20,6, where n = 13. [H], 
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At this point we must consider a problem that is ever present 
whenever a sample is used as a source of information,, In the present 
case the problem is stated as follows; How do we know that the sample 
size, n, was large enough, so that the conditional variance associated 
with the combination we have just selected is accurate? (We will 
always assume that n is greater than p). 

Intuitively, if n is just a little larger than p we should not 
have much confidence in sample vector Z and sample V-C matrix S, 
nor in the estimated regression coefficients or conditional variances 
of yp. In fact we shouldn’t be surprised if a second sample of the 
same size were to produce a comnletely different set of variables when 
the same step-wise procedures are used. On the other hand, as n 
approaches infinity the samples Z and S approach the true values of U 
and . It is clear that at each step, each variable that is a candi- 
date to enter regression should be given a statistical test of some 
kind. 

Suppose q variables, yj,...,yq, are already in regression with 

estimated conditional variance of y„ given by s j and suppose 

P ' pp.l,...,q 

that we are considering variable y^ for addition to regression. It 

can be shown that if actually Q~ , , = Q~ , , then the 

1 '~"pp. I,...,q:,k '“'pp. I,...,q* 

statistic 

„ \ s pp.|,...,q - (n-q-2) s pp . | , ...,q,k 

(o. I) F 

S pp.l,,..,q,k 



has the F distribution with 
section 6.1;. Furthermore, 



I and n-q-l degress of freedom £l|J, 
statistic F wi I I tend to be greater than 
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F (l, n-q-l) if Q- is actually less than Q- . 

pp.l,...,q,k ' pp.l,.oo,q 

We immediately encounter a new complications The above statistic 
F behaves as stated above so long as variable y^ is studied by itself. 
However, the selection of y^ from among those variables still not in 
regression was not completely at random, y^ was chosen at this time 
because it was estimated to be the "best" variable to add at this 
step. In other words, we are in effect computing F for a number of 
variables and choosing the variable for which F is the largest. It 
is important to realize that due to this method of selection, the F 
statistic used with the selected variable y^ will tend to be larger 
than would be expected on the average if variable y^ were to be studied 
as an individual variable alone. Intuitively, this effect should be 
stronger with the first variables added to regression, since those 
variables for which F is large due to randomness, are removed from 
those not in regression early. Suggested procedures for compensating 
for this are discussed in a later chapter. 

Let Q( be the probability of erroneously concluding that 

Cr . | , is less than Q~ . whenever actually they are 

°pp. I,...,q,k' pp. I ,...»q 7 7 

equal (OOs usually chosen to be .05), This error is usually called 
the type one error. Suppose now, at each step we compute the statis- 
tic F of formula 6.1, and compare with the value of F^, , lN 

OC(l, n-q-l) 

which can be found in tables of the F distribution. If the sample size 

is too small the power of the test will be low. This means that the 

actual difference between Q- and Q- , can be sub- 

^pp.l,...,q pp.l,...,q,k 

stantial and still, the probability that the computed statistic, F, . . , 

will exceed F rv, . .\ can be small. (Of course, this probability 

CX( I, n-q-l) 
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will always be greater than (X ) • This error is usually called the 
type two error. 

On the other hand, given that Q~ , _ is actually greater 

than | k (the only alternative being that they are equal) 

regardless of how small the actual difference, the probability that 
statistic F exceeds Fq^ | n _q_|) can be made as close to one as we 
please by increasing sample size, n, indefinitely. 

Meanwhile, among those variables for which Cpp. | is 
actually equal to Cj^, # | . . ,q,k» a PP rox ' mately x 100 percent 

are expected to "pass'" the F test (i.e., F =» Fq^ ^ ^ inde- 

pendently of the sample size, n. 

We have just seen that the two important factors that affect the 

probability that variable y^ will pass a particular F test are the 

amount by which the actual values of i and CZ~ i „ l. 

' pp. I..«. ,q '-'pp. I ,...,q,k 

differ, and the size of the sample, n 0 Thus, the decision rule we 
mlgl&t use is to terminate the step-wise procedure at any step that 
all variables still not in regression fail to pass the F test. 

With this decision rule, the F test will limit the variables in re- 
gression to those whose contribution to reduction in conditional 
variance of yp appear to be large enough for the given sample to 
measu re. 

In the next chapter we shall consider add! tiona I halt criteria 
which an experimenter may wish to impose on the step-wise process. 
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Chapter VI I 



AUTOMATIC REGRESSION ANALYSIS - CRITERIA 
FOR HALTING STEP-WISE REGRESSION 

In this chapter we shall develop useful procedures for con- 
ducting automatic regression analysis on a sample of size n of a p 
variate normal using a high speed electronic computer, Efroymson 
[3J has developed an algorithm very suitable for computer use in 
which any single variable can be added to, or eliminated from re- 
gression (depending upon its former status). At any step the re- 
gression coefficients, conditional variance of yp, multiple correla- 
tion coefficient of yp on the variables in regression, and many 
other desirable parameters can be computed easily and printed out® 
Useful criteria for halting the regression process are discussed 
and developed. 

Given a sample of size n of a p variate normal, formulas for 
computing vector Z and matrix S have already been described. Also, 
basically we shall use the step-wise procedure of adding variables 
to regression. The most important remaining problem is to consider 
how the user of an automatic regression analysis computer program 
can specify in advance of the computer run, reasonable criteria for 
halting the step-wise procedure. 

So far, it appears that a satisfactory criteria for stopping 
the regression process has never been fully developed to suit auto- 
matic step-wise regression. Miller £7] proposes adding variables 
until the F test fails. He also proposes a method of adjusting the 
level for which the critical F is chosen (l - CX ' n chapter VI I ) to 
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compensate for the fact that the method of choosing each variable 

to enter regression is not a random choices 

In order to derive a test for the statistical significance 
of Xj, the following analysis may be performed: When a 

predictor is chosen at random from a group of predictors, 
an F test is performed where the critical F is usually 
taken at the 95$ level. This allows for a one in twenty 
chance for considering this predictor significant when in 
fact it is not. In the screening procedure the selection 
of Xj is not a random choice. Therefore, it is necessary 
to determine at what probability level the critical F 
should be taken while still specifying a one in twenty 
chance occurrence. 

For the screening procedure it appears proper to make the 
level for which the critical F is chosen a function of the 
number of poss i b I e predictors, n. The ordinary 95$ level 
F can be expressed as 



F .95 = F (l - l/20)' 

and for the screening procedure the 95$ level is 

_ a|c 

F .95 = F (l - l/20. n)* 

Intuitively, Miller's solution seems to be somewhat extreme. 
For example, if p = 51 (arid (X “ . 05 ) then at the first step the 
level chosen for the critical F, ot', would be computed as follows: 



I - 



a 



1 - 



20 x 50 



.998; <X= 1 - .998 .00 1 , 



so that the value used for comparison would be F Q01 ( 1 ,1+9) * 12.2, 

rather than F ( 1,1+9) = 1+.03 when no adjustment is made. In this 
» u 5 

case the critical F value is arbitrarily tripled only because there 
are 50 variables still not in regression. Granted that the critical 
F should be adjusted upward in order to maintain a "one in twenty 
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chance occurrence", it would seem that due to lack of information 
as to the extent of this non-random effect, one should make such an 
adjustment more conservatively than this. 

Perhaps a satisfactory "hedge" might be to use the adjusted 
I eve I : 




PC 

log p 



or (X- 



a 

log p 



K, 



where K is a constant inserted by the program user for his parti- 
cular sample. 

Conceivably, one might wish to make no adjustment at all for 
this effect because the consequences of increasing the type two 
error during the early steps are so detrimental to the step-wise 
procedure, 

Efroymson proposes two F tests at each step. His program 
first compares each variable, y., currently in regression with an 
appropriate "min F" critical value to see if it still passes the 
F test of significance. If such a variable is discovered, the 
action at that step is to remove the variable from regression. By 
setting min F to a value slightly less than the standard critical 
value used for adding variables, the possibility of creating an 
endless loop is avoided. 

This feature is appealing because new combinations in regression 
obtained in this manner are always more nearly optimal (as far as 
the sample is concered) than was the preceding combination of the 
same size; yet the number of computer instructions required to do 
this operation is minimal. 
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In chapter VI it was shown that the choice of , . ■> as 

CXI l,n-q-l; 

the value of critical F is made in an attempt to limit the variables 
in regression to those whose contribution to reduction in conditional 
variance of yp is large enough for the given sample to measure. It 
is clear that Efroymson’s double F test contributes to this effort 
by insuring that all variables in regression continue to pass the 
F test even after subsequent variables have been added. 

It is impossible to anticipate here all uses for different 
combinations of specified stopping criteria. We already have seen 
that statisticians so far have only provided general guide lines 
in this area. This is mainly because each individual p variate 
normal distribution has its own set of complications, and for each 
computer run on a given sample the experimenter may have varying 
amounts of prior information regarding the p variate normal he is 
studying. Thus, for any automatic regression analysis computer 
program it is important that the user of the program be able to 
specify halt criteria with as much flexibility as possible. 

Perhaps the most important aspect of each halt criteria is 
that it must be specifiable in a manner most meaningful to the ex- 
perimenter. For example, some experimenters under certain conditions 
may not look upon the F test of chapter VI as being useful to him at 
all. Quite likely, he may wish to replace F 0£(| y n-q-l) w ' th a vaSue » 
say X , to be the critical amount of reduction of the variance of yp 
as a stopping rule; or, he may want to specify both critical values. 
Although A and | n q_|) are ' n different units, it is clear that 
the A test is equivalent to an F test, so that in specifying both 
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tests the experimenter is merely having the computer apply whichever 
test is the most stringent at each step. 

The following example illustrates most of the points covered 
in the last few paragraphs. We show here how a suitable choice of 
min F and critical F, artificially chosen, can aid Efroymson's 
double F test procedure to find a more nearly optimal combination 
of variables in regression than already obtained by the step-wise 
procedure at a previous step. To do this we take an example worked 
out by Hald |6>J, section 20.3. In this example, Hald used data 
from a sample of size 13 of a five variate distribution which we 
will assume here to be normal. The sample vector, Z, and sample 
V-C matrix, S, are the same as those shown by equations i; 0 1 and 
h»2 in this paper, which in chapter IV were used to define U and 
Y respectively. In the following illustration we shall consider 
U. I and U»2 to be computed Z and S as in Ha Id's example. 

From Hald's example we compute the F statistic J6. ij for each 
variable in regression and not in regression at each step. See 
table IV below. For variables in regression, y k , the value com- 
puted is the F statistic that would be computed for y k if it, alone, 
were removed from regression first. These values pertaining to 
variables currently in regression are underscored in table IV. In 
order to illustrate the above points, the F test using Fq£( j n _q_|) 
was eliminated. 

We now choose the artifical values of critical F and min F to 
be 3.5 andi3 »0 respectively. With this choice we shall obtain the 
optimal combination of variables y ( and yj, where the regular 
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forward step-wise procedure yielded variables yj and y^ in chap- 
ter IV. 

Table IV 



Step 


Vari ab les 
i n Reg r. 
Before 
This Step 


y i 


Computed F 
*2 


Stati stic [ 6 0 

*3 


■') 

Xh 


1 


0 


12. 6o 


21.96 


bob o 


( 22.80) 


2 


y b 


( 108.16) 


.17 


1 * 0.30 


22.80 


3 


y i> yb 


108. |6 


( 5.03) 


bo 2b 


159.21 


b 


V V y b 


15^.02 


5.03 


.01 


00 

o 



5 y , y- the optimal combination for 

two va r i ab I es in reg ress i on 



The variables added to or eliminated from regression were chosen 
according to Efroymson’s double F test procedure. Recall that no 
variable was to be added at any given step if the F value of one of 
the variables a I ready in regression got below 3.0 (min F). Hence 
at step ht variable yj^ was eliminated yielding the optimal combina- 
tion yj, yg. At each previous step, the variable added (whose value 
is enclosed in parentheses) was chosen because its F statistic was 
the largest among those still not in regression and was also greater 
than the critical F value which was artifically chosen to be 3«5» 

This example illustrates some complexities that arise during 
the regression process that are still not completely explainable 
analytically. For instance, the relative values of statistic F 
changed drastically as the combination of variables in regression 
changed. These values correspond to relative amounts of reduction 
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of conditional variance that would be due to the corresponding 
variable if it were (or is) in regression. Thus, when was added 
to regression at the first step the relative contribution in variance 
reduction due to yj jumped from 13 to 108, implying that yj and y ^ 
are much more powerful together than their sum when each is used 
a lone. 



This example also suggests reasons why an experimenter may wish 
to specify critical F values artificially, especially if results of 
prior computer runs are avai lable. 

It was suggested earlier that instead of keeping track of com- 
puted values of |6.|J requiring specification of artificial critical 
F’s on the part of the experimenter, it might be simpler for him to 
keep track of actual amounts of variance reduction of yp and make up 
artifical values of A > n units of variance reduction of yp. Also 
it is clear that the experimenter may wish to specify a value, say, 
-InA* A , which would become the critical amount of variance re- 
duction requi red of each variable in regression, in order to stay in 
regression. 

The following summary lists a few useful halt criteria which the 
experimenter may wish to specify before the automatic regression 
analysis is performed on a given sample. The automatic regression 
program should permit the experimenter to specify any combination 
of these criteria for any given computer run: 

1. Frv / , and min F (Chapter VI ) 

CX( I ,n-q-l ) 

2. A and minA (defined above) 

3* Stop when the conditional variance of yp gets as low as 
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V percent of the original variance, s pp» 

b* Stop when the conditional variance of gets as low as T, 

5* Stop when W variables have been added to regression. 

In chapter IX we shall propose a procedure using some of the 
above halt criteria in searching for an optimal combination of vari- 
ables in regression. 

In chapter V it was stated that one good reason for reducing 
the number of variables in regression might be to reduce the cost 
of observing the variables from which each future prediction of y_ 

r 

is to be computed. Often some of the variables cost considerably 
more to observe than others, and the experimenter may not be so 
interested in reducing the total number of variables to observe, 
as he is in reducing the tota I cost to observe the values of the 
variables in regression for each prediction of yp to be made later. 
Thus, it is desirable that the experimenter be able to specify ob- 
servation costs, cj, (say, in dollars) for each "independent" vari- 
able y, ,...,y p _|, anc * have the automatic regression analysis operation 
reflect these costs when selecting variables to go into regression. 

The "cost option" should differ from the regular option only in 
the criteria used at each step to determine which variable is to be 
added to regression. Recall that the regular option calls for chosing 
the variable that will reduce the variance of yp the most, to be the 
variable added. 

In the cost option, at each step, those variables still not in 
regression are determined. Then, instead of y. for which Qr . . 

is least, (as estimated by s__ , _ .,), y» is chosen for which 

pp* i > o • • J 
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c j / ^pp. I , . ,q, is least| '° e °' Vj is chosen on 

the basis that it is cheapest in terms of "dollars" to observe per 

unit of variance reduction of y , due to adding y It is clear 

P J 

that the standard option is just a- special case of the cost option 
in which the observation costs are all specified to be equal. 

Since optimality is now measured in terms of minimum cost to 
observe per unit of variance reduction instead of maximum variance 
reduction, the program user must be able to specify a halt criterion 
so that whenever the cost to observe a variable in regression be- 
comes greater than, say, max C, the program will remove it, and 
whenever all variables still not in regression would cost more than, 
say, C dollars per unit of variance reduction, if added, the program 
should halt regression. Now, min A and A are not needed as halt- 
ing criteria for the cost option. However, the experimenter should 
still have the option of including other halt criteria summarized 
above. 



To summarize, neither Miller’s nor Efroymson's stopping rules 
are optimal. Both basically use only the statistical F test of 
chapter VI as a decision rule for halting. It has been illustrated 
here that additional decision criteria that can be specified by the 
experimenter in terms more meaningful to him, may greatly facilitate 
his search for optimal combinations of variables in regression. 
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Chapter VI I I 

THE MV REGRESSION AND MV SIM COMPUTER PROGRAMS 



The purpose of this chapter is to describe a computer program, 
called MV REGRESSION, which performs automatic regression analysis 
on a sample of size n, Also briefly outlined is program MV SIM 
which generates samples of size n from a specified p variate normal* 
The detailed operation of MV SIM is described in appendix A* Both 
programs are written in NELIAC compiler language* Operation of 
these programs on the Control Data Corporation model l60l+ computer 
at the U, S, Naval Postgraduate School has produced all of the com- 
putations involved in the examples throughout this paper as well as 
the test results discussed in chapter IX and appendix B, 

Briefly, MV SIM will analyze a specified p variate normal (given 
by U and X!) and print out true regression coefficients and associated 
for any set(s) of q variables specified by the program 

^pp. I,...,q 

user (q — p-l)* Next, MV SIM wi I I generate a sample of size n from 
the specified p variate normal and compute sample vector Z and sample 
V-C matrix, S. Before turning control to program MV REGRESSION, MV 
SIM performs statistical tests on Z and S, and prints out results of 
these tests, but takes no action based on these results* These sta- 
tistical tests and actual computer run results are discussed in de- 
tail in appendix B. 

Before proceeding with a description of MV REGRESSION, it is 
interesting to consider the powerful research tool one has when he 
can specify a p variate normal (U and ) and quickly generate 
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random samples from that distribution# It is obvious that this 
operation saves much time in gathering data, or in "making up" 
reasonable samples when it is desired to test the operation of a 
regression analysis program such as MV REGRESSION. (This was the 
case when computations for example in this paper were required)# 
But MV SIM offers the statistician a much more useful research 
capability than this. Using MV SIM one can make accurate compari- 
sons of the results of any regression scheme with true regression 
equations, conditional variances, etc#, which MV SIM computes from 
the specified U and £ # Of course, for such a comparison, the re- 
gression scheme must be applied to a sample drawn by MV SIM from 
the distribution specified by U and 

The sampling capability of MV SIM also makes it possible to 
perform empirical sampling studies of random variables whose dis- 
tributions are difficult to find theoretically. One such study, 
now in progress, is discussed in chapter IX# 

We finish this chapter with a detailed description of program 
MV REGRESSION. 

The inputs of MV REGRESSION are as follows; 

I. Start with a sample of n observations of the p variate 
normal. If MV SIM supplies the sample, it will supply 
it in the form of Z and S# 

2# Specify "standard" or "cost" option (see chapter VI l)# 

If cost option, give cost of observation, c., for 
variables y., for i = l,###,p-l# If the user specifies 
"standard", he still may specify costs and obtain 



printed cost data even though the "regular” criteria is 
used as far as entering variables into regression is 
concerned. 

Specify criteria for halting regression of a samples 

A. l) Fq^| the value to be compared with statistic 

F for adding variables to regression. 

2) Min F, a value less than Fq^ , ^ to be compared 
with statistic F for removing variables from 
regression. 

B. l) Last variable added reduced the conditional 

variance of yp by less than ^ (not used for 
cost option). 

2) Last variable added, y^, costs more than C 

dollars to observe per unit of variance reduction 
of yp due to adding y^ (used only for cost option). 

C. Conditional variance of yp became less than T. 

D. Number of variables in regression reached W, 

Before step I of the regression operation, MV REGRESSION prints 
out s , and (optionally) the RR matrix. (The RR matrix is a pxp 

r r 

matrix which contains all current data in compact form from which 
all required parameters at each step can be computed. Initially, 
it is a matrix of sample correlation coefficients which is easily 
computed from sample V-C matrix S. See Efroymson ^Bj). 

At each step, after a variable has been added to regression, 
the following data is printed: 

I a. "Best" variable to have been added (variable with 
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b. "Cheapest" variable to have been added,, 

c. Whichever of the two variables above that actually 
was added (a. if regular option, b. if cost option) 

The value used in the F test for the added (or removed) 
variable. MV REGRESSION compared this value with the 
input value of ^(^(i^n) ( or n • 

a. The square of the estimated new multiple correlation 
coefficient of yp on the variables in regression. 

b. The estimate of the new conditional variance of 

Vp* s pp» I ,..o,q* 

The cost to observe the variable just added, y k , per unit 
of conditional variance reduction due to the addition of 
this variable to regression at this time . This is com- 
puted as c k / ( s pp . | , ...,q ” s pp. I , . ,q,k^ ° 

a. A list of the new set of variables, yj, (i = l,...,q) 
in regression. 

b. The estimated regression coefficients, b». 

The cost to observe the new set of variables in regression 
per unit of total variance reduction of yp. This is com- 
puted ass 



I 

T = I 




S . ) 

PPo I £OOOjCj 



The new RR matrix (optional) 



As indicated earlier, it is possible to specify cost of obser- 
vations, Cj, even though the standard option is used, In this case, 
items IV and VI are still computed and printed, but of course, the 
"best" variable to add (item la) is still the one actually added,. 

Each step, at which a variable is being removed from regression, 
item I above becomes "the variable just removed", and items II, III, 
V, VI, and VII only are printed. 

Minor changes to the program can be made to cause it to print 
out other data after each step, such as estimated variances of the 
estimated regression coefficients. 

The next few pages show the actual program output of a regres- 
sion analysis performed by MV REGRESSION on a sample of size J>00 of 
a five variate normal. This sample was generated by the MV SIM pro- 
gram using input vector U and V-C matrix X! given by Ij., I and 2j..2, 



b5 



MULTIVARIATE ANALYSIS (CONTINUED) 2 19 1963 PAGE 7 

COMPUTER RUN DATA 
NUMBER OF SAMPLES = 3 



CRITERIA FOR CHOOSING WHICH VARIABLE TO ADD TO 
REGRESSION (AMONG THOSE PASSING F TEST) 

MAXIMUM REDUCTION OF THE CONDITIONAL VARIANCE OF Y 5 

HOWEVER, 

THE FOLLOWING COSTS OF OBSERVATION ARE SPECIFIED 

Y 1 Y2 Y3 Y4 Y5 

10.0000 12.C000 16.0000 20.0000 .0000 



ANY ONE OF THE FOLLOWING CONDITIONS CAN HALT REGRESSION STEPS 

1) NUMBER OF VARIABLES IN REGRESSION REACHED 4 

2) CONDITIONAL VARIANCE OF Y 5 BECAME LESS THAN 4.0 

3) LAST VARIABLE ADDED REDUCED THE CONDITIONAL VARIANCE 

OF Y 5 BY LESS THAN 2.0 

4) LAST VARIABLE ADDED COSTS MORE THAN 10.00 DOLLARS 

TO OBSERVE PER UNIT OF VARIANCE REDUCTION OF Y 5 

5) NO MORE VARIABLES (AMONG THOSE NOT IN REGRESSION) 

PASS THE F TEST OF SIGNIFICANCE 
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MULTIVARIATE ANALYSIS (CONTINUED) 2 19 1963 



PAGE 8 



SAMPLE NUMBER 
SAMPLE OF SIZE 

SAMPLE MEANS 



1 



3C0 OF THE 5 VARIATE NORMAL 



Y 1 

7. 4125 



Y2 

48.2586 



Y3 

11.9077 



SAMPLE VARIANCE COVARIANCE MATRIX 



Y 1 

31.5587 



Y2 

14.6327 



Y3 

28.5108 - 



Y4 

29.8140 



Y4 

17.8985 



14.6327 227.5360 
28.5108 - 3.9154 
17.8985 - 243.4449 
55.3288 171.7598 



3.9154 - 243.4449 
39.1032 - 7.6404 

7.6404 276.5990 

39.8334 - 191.4721 



Y5 

95.4458 



Y5 

55.3288 

171.7598 

39.8334 

191.4721 

199.0421 



MULTIVARIATE ANALYSIS (CONTINUED) 2 19 1963 PAGE 10 

ANALYSIS OF SAMPLE NUMBER 1 

SAMPLE VARIANCE OF Y 5 = 199.0421 

F LEVEL TO ENTER = 3.87 F LEVEL TO REMOVE = 3.7 



MATRIX TO 


START 








Y 1 

1.0000 


Y2 

.1726 - 


Y3 

.8116 - 


Y4 

.1915 


Y5 

.6981 


. 1726 


l.COOO - 


.0415 - 


.9703 


.8070 


.8116 - 


.0415 


1.0000 - 


.0734 - 


.4515 


.1915 - 


.9703 - 


.0734 


l.OCCO - 


.8160 


.6981 


.8070 - 


.4515 - 


.8160 


1 .0000 
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MULTIVARIATE ANALYSIS (CONTINUED) 2 19 1963 PAGE 11 

STEP 1 

BEST VARI.ABLE TO ADD WAS Y 4 
CHEAPEST VARIABLE TO ADD WAS Y 2 
VARIABLE ADDED WAS Y 4 

STATISTIC USED TO COMPARE WITH F ( 1 * 298) = 595.9689 

NEW MULTIPLE CORR COEFF SQUARED = .3340 

NEW CONDITIONAL VARIANCE = 66.7210 

COST TO OBSERVE Y 4 IN DOLLARS PER UNIT VARIANCE REDUCTION = .1511 

NEW SET OF VARIABLES IN REGRESSION 
4 



COEFFICIENTS 


E(I ) 


BO = 1 16.0842 




.6922 


.cooo 


.0000 


.0000 


.0000 


COST TO OBSERVE THIS SET 
OF VARIANCE RECUCTION 


OF VARIABLES 
OF Y 5 


PER UNIT 




20.0000 

132.3210 


DOLLARS DIVIDED BY 

UNITS OF VARIANCE REDUCTION = 


. 1511 


THE NEW RR MATRIX 








Y 1 

.9632 - 


Y2 

. C 1 32 - 


Y3 

.8256 


Y4 

.1915 


Y5 

.5417 


.0132 


.0583 - 


.1128 


.9703 


.0152 


.8256 - 


.1128 


.9946 


.0734 - 


.5114 


.1915 - 


.9703 - 


.0734 


1.0000 - 


.8160 


.5417 


.0152 - 


.5 114 


.8160 


.3340 
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PAGE 12 



MULTIVARIATE ANALYSIS (CONTINUED) 2 19 1963 

STEP 2 

BEST VARIABLE TO ADO WAS Y 1 
CHEAPEST VARIABLE TO ADD WAS Y 1 
VARIABLE ADDED WAS Y 1 

STATISTIC USED TO COMPARE WITH F(1, 297) = 3089.6640 

NEW MULTIPLE CORR COEFF SQUARED = .0293 

NEW CONDITIONAL VARIANCE = 5.8889 

COST TO OBSERVE Y 1 IN DOLLARS PER UNIT VARIANCE REOUCTICN = 



NEW SET OF VARIABLES IN 


REGRESSION 






1 


4 








COEFFICIENTS 


e< I ) 


BO = 102. 


8895 




1.4124 - 


.6008 


.0000 


.0000 


.0000 


COST TO OBSERVE TFIS SET 
OF VARIANCE REDUCTION 


OF VARIABLES PER UNIT 
OF Y 5 




30.0000 
193. 1531 


DOLLARS DIVIDED BY 

UNITS CF VARIANCE REDUCTION = 


. 1553 


THE NEW RR MATRIX 








Y 1 

1.0380 - 


Y2 

. Cl 37 - 


Y3 

.8571 


Y4 

.1988 


Y5 

.5624 


.0137 


.0581 - 


.1241 


.9730 


.0226 


.8571 - 


. 1241 


.2868 


.2376 - 


.0470 


.1988 - 


.9730 - 


.2376 


1.0380 - 


.7082 


.5624 


. C226 - 


.0470 


.7082 


.0293 



.1643 
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PAGE 13 



MULTIVARIATE ANALYSIS (CONTINUED) 2 19 1963 

STEP 3 

BEST VARIABLE TO ADO WAS Y 2 
CHEAPEST VARIABLE TO ADD WAS Y 2 
VARIABLE ADDED WAS Y 2 

STATISTIC USED TO COMPARE WITH F( 1, 296) = 127.14672 

NEW MULTIPLE CORR COEFF SQUARED = .0205 

NEW CONDITIONAL VARIANCE = 4.1344 

COST TO OBSERVE Y 2 IN DOLLARS PER UNIT VARIANCE REDUCTION = 
NEW SET OF VARIABLES IN REGRESSION 



1 


2 


4 




COEFFICIENTS 


B ( I ) 


BO = 75.6177 




1.4258 


.3643 - 


.2792 .0000 


.0000 



COST TO OBSERVE THIS SET OF VARIABLES PER UNIT 
OF VARIANCE REDUCTION OF Y 5 

42.0000 DOLLARS DIVIDED BY 
194.9076 UNITS OF VARIANCE REDUCTION = .2154 



THE NEW RR 


MATRIX 








Y 1 


Y2 


Y3 


Y4 


Y5 


1.0413 


.2360 - 


.8864 


.4285 


.5677 


.2360 


17.1986 - 


2.1 349 


16.7347 


.3895 


.8864 


2.1349 


.0218 


2.3150 


.0012 


.4285 


16.7347 - 


2.3150 


17.3214 - 


.3292 


.5677 


.3895 


.0012 


.3292 


.0205 



3) LAST VARIABLE ADDED REDUCED THE CONDITIONAL VARIANCE 
OF Y 5 BY LESS THAN 2.0 



6.8394 
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Chapter IX 

CURRENT STUDIES AND PROPOSALS FOR FUTURE RESEARCH 



In this chapter we discuss tests that have been started using 
programs MV SIM and MV REGRESSION. Also, plans for future research 
are proposed. 

Some tests (described in Appendix B) of a large number of 
samples generated by MV SIM have been completed. 

An empirical sampling study to study the random variables in- 
volved in the F test of chapter VI has been started since the form 
of the distribution is unknown and extremely difficult to obtain 
in closed form. Actually, p-l random variables, which we will call 
Gj,...,Gp j, are under study at the same time 0 They are defined by 
a specified p variate normal, the size of each sample of the p 
variate normal, n, and the method of computing values of G., 
i = I , . • • , p— I , from a sample which is described next. 

At step one, G| is defined as the maximum value of ‘ 1 6 0 ij where 
F is computed for each of the p-l variables (none of which are in 
regression yet). Gg is dependent upon Gj in the sense that G2 is 
the value of max F 1 6. l) computed after the variable for which F 
equals G| has been entered into regression. Thus, at step two, max F 
is the maximum value of F for those p-2 variables still not in regres= 
sion. The step-wise procedure continues without the use of any tests 
for halting so that a new variable is added at each step. Thus, at 
step i, Gj equals max F, where F is computed for each variable still 
not in regression by step i. After G. is recorded, the variable for 
which F * Gj is entered into regression. 
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Since the values of |6» Ij depend upon the sample size, we see 

that each sample of size n of a specified p variate normal produces 

one value of each of the random variables G j , , , , ,G^_ j 0 Also, to 

obtain repeated sets of values of the same random variables, the 

sample size must be kept constant. 

The tests that have been completed were performed on the five 

variate normal specified by formulas lu I and iu 2, Six sample sizes* 

50 , 100, 150 , 200, 250, and $00 have been computed 50 times each. 

The results of Gj, G^* G^, G^ for the sample size 100 are plotted 

below in the form of estimated cumulative distribution functions 

(c,d,f.'s). Where feasible, the graphs also show the curve of the 

c.d,f. of F /, ,\» (Recall that if the F test of chapter VI had 

\ l,n-q-U 

been applied, each value of would have been compared with 

F (X(l»n-q-l) at Step qr *' 1, f ° r ^ = °* '» 2 » 3)* 
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So far the min A parameter test has not been Implemented In 
program MV REGRESSION so the type of art! flea 1 control of the step=> 
wise process described in chapter VII has not been tested,, However, 
a number of samples of size 300 of an 18 x 18 matrix (the same matrix 
used in Appendix B) have been processed by MV REGRESSION, using 
rather wide limits on the halt criteria# After examination of the 
first run it was obvious that three variables in regression were 
— too many and that either one or two would be the right number# 

Since the sample size was large, most samples allowed nine or more 
variables to enter regression on the basis of passing the F test 
even though nearly all of the variables beyond two reduced the 
estimated conditional variance of y^g by less than I #0 unit# By 
comparison, the first variable usually reduced s ( g from about I8#6 
to about 6.5# An examination of the computed statistics of all 
variables (whether in regression or not) made it apparent that some 
test such as the minA test might be quite useful here# 

Advantage was taken of the fact that the true p variate normal 
was known when samples obtained from it were being analyzed by 
MV REGRESSION. For example, after the first run on several samples 
of the 18 variate normal, only six of the 17 possible predictors 
ever got into regression by the third step# Hence, all possible 
pairs of these six variables were fed back to MV SIM for which the 
true conditional variances of y^g were computed# 

The various halt criteria suggested in chapter VII can be useful 
in developing methods of searching for optimal combinations of vari~ 
ables in regression# It is proposed that procedures, such as the 
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one described below, be tested and compared with procedures already 
described to see if better results can be obtained* 

We wi 1 1 assume that an experimenter has a large sample that 
perhaps was very expensive to obtain. We shall permit the experl- 
menter two computer runs on the sample samples the first run pro- 
viding a set of feed-back data for the second run. 

The main purpose of the first computer run is to determine a 
lower bound on the conditional variance of yp. This is accomplished 
by using the F (and min F) test with the step-wise procedure with 0 ( 
set to permit most variables to enter regression. Of course, at each 
step valuable information such as the conditional variance of yp, and 
the amounts of variance reduction due to each variable should be 
pri nted. 

From the first run the experimenter chooses the maximum number 
of variables, say m, that he will have in his final prediction equa- 
tion. This is usually easy to do by examining the decreasing values 

of s^ s„ . 0 , s n „ , ; where q, q — m, represents the 

pp.l pp.1,2,... pp.l,...,q -i" t 

number of variables in regression after the first computer run. 

The purpose of the second computer run is to make a rather 
thorough (but not exhaustive) search for the optimal combination of 
m variables in regression. The procedure is to conduct p~l separate 
regressions, each regression starting with a different first variable, 
and continuing until m variables are in regression. At each step 
(after the first), the variable chosen to enter regression will be 
the variable that can contribute most reduction in the conditional 
variance of yp, unless , by adding this variable, a combination that 



58 



had been in regression previously (during a previous regression) 
would result. For example, if the first regression added variables 
in order yj, y^, yg, then if the second regression proceeded as 
yg* ^5* variable yj would not be permitted to enter regression next. 
Instead, the second best variable would be chosen at this step 0 

Thus, after the second computer run is completed the experi- 
menter will have (p-l)xm prediction equations (and conditional 
variances of y ) to choose from, p-l for each number of variables 
in regression. 

Two further investigations are proposed. In Appendix B, the 
results of tests of a number of samples of a five and an 18 variate 
normal are described. As a result of the failure of the sample V-C 
matrices, S, of the 18 variate normal to pass the chi-square test, 
it is proposed that further testing of the multivariate normal 
generator be conducted. As indicated in Appendix B the possibility 
of round off error should be considered. 

It is also suggested that a study be made to ascertain which 
of the two suggested tests of the matrix S is better. Possibly a 
study would indicate weakness in both, Anderson [ij, section 10,8, 
describes a third test of matrix S, 

The step-wise procedure of regression analysis as described in 
this paper is called the "forward* method because it starts with no 
variables in regression and adds them to regression one at a time. 
This is because the forward procedure permits computational short- 
cuts so that the number of computations can be minimized (especially 
so when Efroymson’s computer program algorithm is used |3|)o The 
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backward operation of removing extraneous variables, however, offers 
no computational advantages. See Quenouo lie j^J . Another reason 
the forward procedure can be done with fewer computations is because, 
usually the number of variables in the final regression os much less 
than p-l. Often the reason for a large number of independent vari= 
ables to be examined compared to the number finally used, os that 
from those variables actually measured additional variables are often 
created to account for possible curvi I Ineari ty and interaction. For 
example, i f Xj is a variable whose value was actually measured, 
variables Y ■ X|^, Z = X ^ may be computed and used as part of the 
original p-l possible predictors. j^J , see page 20. 

One possible advantage in using the backward method is to start 
the process by computing an estimate of the lowest possible value of 

the conditional variance of y . s , „ , . If somehow this value 

could be obtained before the forward procedure was performed, one 
could estimate the amount of reduction available in the combined com- 
bination of variables still not in regression at each step. Knowledge 
of this value at each step should be useful in deciding which way would 
be best to go nexts i.e., eliminate the weakest variables now in re- 
gression, or add the strongest variable still not in regression, or 
to halt. 
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Appendix A 



GENERATION OF THE P VARIATE NORMAL 
BY PROGRAM MV SIM 



For the construction of each sample (of size one) from the 
specified p variate normal, MV SIM uses an independent sample of 
size p from the normal (0, I ) distribution, (e.g., mean u = 0, 
variance Q~ ~ 0 * 

To obtain each independent normal random sample (of size one), 

MV SIM computes a function of an independent sample of size 12 from 
the uniform (0,l) distribution, (e.g., uniform on the interval zero 
to one). That this function only approximates normally distributed 
random numbers will be shown below. 

It follows from the above that to generate a sample of size n 
of a p variate normal, nxpxl2 random numbers from the uniform (0,l) 
random number generator are required. 

A discussion of several techniques for generating uniformly 
distributed "pseudo" random numbers is given by Barron . 

Empirical test procedures are also given. 

The particular uniform (0, 1 ) pseudo random number generator 
used by MV SIM is a subroutine called RAND. RAND was programmed 
according to specifications given by Green, Bert F 0 Jr., Smith, J. E., 
and Klem, Laura The number of initial random numbers, n in the 

reference, used by RAND is seven. This article also discusses a 
number of empirical tests that have been applied to this methodo 

The method by which MV SIM uses 12 independent uniform (0,l) 
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random numbers to compute each pseudo normal (0,1) random number 



is discussed by Vaa |^1 OJ , see page 1*0, Briefly, each normal random 
number is computed as: 



the uniform (0,l) distribution. The variance of the uniform (0,l) 
distribution is one-twelfth and variances of independent, uniformly 
distributed random variables are additive under convolution. Hence 
it is convenient to select 12 as the number of uniform random vari- 
ables whose sum will approximate a normal variable. Means of (inde- 
pendent) uniform variables are also additive so that it remains to 
subtract the constant six from the sums of 12 independent uniform 
(0,l) random variables to approximate the normal (0,l) distribution, 
Vaa has a discussion of the advantages and disadvantages of this 
"truncated” approximation to the normal distribution. 

Wold [l lj » pages xi to xi i i , describes the method which MV SIM 
uses to convert an independent sample of size p from the normal 
(0,l) distribution to a sample from a p variate normal specified by 
U and X! , This method requires the computation of a pxp triangular 
P matrix, P from the original V-C matrix, Z! , so that the 

following matrix equation holds: 



For our discussion we arbitrarily choose the triangulation of 



12 




where the Wj are the required independent sample of size 12 from 
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P = ^Pjj} so that pjj = 0 when j =» i, i.e., let all "upper diagonal" 
elements of P equal zero. Next, assuming X|,..«,x is an independent 
sample of size p from normal (0,l), the sample of size one of the p 
variate normal is computed as? 

y i “ u i + p i i x i 

y 2 - u 2 + p 2| x, + p 22 x 2 

• 

o 

9 

y P - u p + p p i x i + ••• + p PP v 

where the U( are the elements of mean vector U. 

The term "pseudo" random number is customarily given to numbers 
generated by arithmetic means, see Barron j^2j , pages 5 » 6 , of which 
the RAND subroutine is one. 

It is now clear that the samples of size n of the p variate 
normal generated by MV SIM, are themselves pseudo random numbers, 
since they are merely arithmetic functions of uniform pseudo random 
numbers,, Perhaps in this context, the operation of this part of 
MV SIM might have been called "simulation" of a p variate normal, 
rather than "generation". To carry this process one step further, 
sample mean vector Z, and V-C matrix S, being arithmetic functions 
of a sample of size n, are likewise pseudo random matrices. As in 
the case of the pseudo uniform and normal random numbers, it is 
desirable that some empirical tests be applied to these pairs of 
pseudo random matrices. 

Appendix B describes some tests in details one for vector Z, 
and one for matrix S. These tests are (optionally) performed by 
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MV SIM on each sample, but MV SIM takes no corrective action except 
to print out the value of the computed statistics and an indication 
of the proper distribution to be compared with the statistics,, 

The Sequential Operation of Program MV SIM is as followss 
l« Print out input mean vector U, and V-C matrix XI, and 
other miscel laneous data identifying the computer run 0 
2« Compute the P matrix from X! as described above, Op- 
tionally, the P matrix may be printed out 0 

3. List the variance of y . • 

'P pp 

4a Compute the prediction equation for yp, for each combina- 
tion of variables, y | , a* , ,y j that are specified by the 
program user as input. For each such regression the 
following data are printed t 

a) regression number 

b) qjH variate normal, where q is the number of variables 
i n reg ress i on 

c) multiple correlation coefficient (squared) 

d) conditional variance of y - Q~ 

p ppal,..o,q- 

e) the regression coef f i ci ents, J3 - (optional) 

5, Print out input data regarding samples of the specified 
distribution as described and illustrated in chapter VI 1 1 
i,e,, numbers of samples, observation costs, whether 
"standard" or "cost" option is used, etc. 

The following operations are performed on each sample specified 

6, Generate the requi red sample of the specified p variate 
norma I , 
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7. Compute samples mean vector Z, and V-C matrix*, S c Print 
out Z and S. 

8» Test sample means, Z (optional) (see Appendix B) 0 Print 
out eigenvectors and eigenvalues of matrix, S, from which 
the proper statistic is computed. Aiso print out the 
statistic and the proper degrees of freedom of F to be 
used for comparison. 

9« Test sample matrix, S (optional) (see Appendix B). 

Print out eigenvalues of sample matrix, S. Print out 
the statistic to be compared with chi~squared distri~ 
bution. Also print out proper degrees of freedom to be 
used for comparison. 

Of course, the user of program MV SIM can omit some of the above 
items such as items 3 and h at his discretion. 

The actual analysis of each sample and associated printed output 
performed by MV REGRESSION is described and illustrated in detail in 
chapter IX. 
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Appendix B 



TESTS OF SAMPLE MEAN VECTOR, Z, 

AND SAMPLE VARIANCE-COVARIANCE MATRIX, S 



For a discussion of some of the problems encountered in generating 
random numbers by arithmetic means, see Barron and Vaa ^9j • 
Graybill £l;j, page 206, shows that if Y is a p variate normal 
with mean vector U and V-C matrix X] » then the quantity?: 

v ■ (Z - U) T S 1 (Z - U) (n - p) / p ( n - I), 



is distributed as F, N , if indeed Z and S are computed from a 

(p,n-p) 

sample of size n from the specified p variate normal,, Hence, to test 
a sample mean vector, Z, an appropriate level, CX, (usually , 05 ) is 
chosen. Then if v is less than Fq,^ n p)» vector Z is accepted as 
having been computed from a reasonable sample; otherwise Z is rejected, 

To perform a test for a sample V-C matrix, S, an orthogonal 
transformation is performed on both X! and S, separately, yielding 
diagonal matrices A and D respectively, A is a V-C matrix of a 
p variate normal with independent variables (i,e,, all covariances 
are equal to zero). Now, if it is true that S is computed from a 
sample drawn from a p variate normal with V-C matrix, X] » then D 
should be a sample drawn from a p variate normal with V-C matrix. A, 
Hence, a test that D is a sample from A should verify that S is a 
sample f rom X] • 

Since each element of D, s/. 0 = l,o,,,p), is a sample variance, 
and since each element of A, C-.pis the true variance corresponding 
to element s/., for all I, intuitively, it appears that each of the 
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statisticsr 



(sj . / C'j j) ° (n » I ) 5 * l,...,p t 

should have the chi-square distribution wi th n - I degrees of freedom 

(n is still the sample size). From this, and the fact that s^j is 

statistically independent of s i. for all i, j - S,...,p (i P j) , it 

J J 

follows that the statistics 

P 

(Bl) (n - I) L (s' / C-') 

1*1 11 '' 

has the chi-square distribution with p°(n-l) degrees of freedom, 
since the degrees of freedom of sums of independent chi-squares are 
addi ti ve. 

Hence, to test each sample V-C matrix, S, MV SIM "rotates” 
and S, and computes formula Bl above from A and D. Printed out 
(optionally) are the p diagonal elements of A and D (the eigenvalues 
of matrices £ and S respectively). Also printed are the result of 
formula Bl and the number of degrees of freedom of the chi-square 
distribution to be used for comparison. 

Programs MV SIM and MV REGRESSION were used to generate and test 
a number of samples from two different p variate normals. One of 
these normals is specified by 4. I and 4.2 (five variate normal). The 
other distribution was an 18 variate normal that was very close to 
being singular. (Several sets of rows were close to each other in 
value) . 

Six sample sizes; 5 0 , 100, 150, 200, 250, and 300 were studied 
of the five variate normal, with 20 samples tested of each size. 
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Four sample sizes: $0, 100, 150, and 200 were studied of the 

18 variate normal, with 20 samples tested of each size. 

For the five variate normal, both statistics (for Z and S) 
appeared to behave as samples from their respective F and chi-squared 
distributions for all sample sizes. 

However, curious results were obtained from the unusual 18 
variate normal tested. All Z tests passed as nicely as for the five 
variate normal. However, the values of chi-square were much too 
high, indicating poor sample V-C matrices, S, were being generated. 
For example, for the 20 samples of size 100 (of the 18 variate normal) 
the statistic Bl should behave as chi-square with 1782 degrees of 
freedom (which is the mean of that distribution). The 20 computed 
values of Bl ranged from 2213 to 2683 . 

A possible reason for these poor results could be due to the 
use of a poor random number generator. However, the satisfactory 
results obtained from testing the five variate normal, as well as 
tests of the uniform random number generator conducted previously 
leads one to seek a different source of error. 

Possibly a more reasonable explanation is the likelihood of 
computer round off error. The large number of computations required 
to rotate an 18 x 18 matrix plus the fact that the matrices were all 
nearly singular could very likely cause this type error. If this is 
the case, the generated sample V-C matrices themselves may be "good" 
samples that are merely difficult to test. 

Another interesting possibility is the method used to rotate 
matrix S for the test. Recall that rotating a symmetric matrix, £ t> 
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to yield a diagonal matrix, LA, can always be done by finding an 
orthogonal matrix, R|, so that the following is satisfied? 



(B2) 



R r £• R I - A- 



Also since S is also symmetric R 2 can be found so that 

rJ ' S ♦ R 2 « D , 

where D is diagonal. Since £ and S are not exactly equal if follows 
that orthogonal matrices Rj and R^ will not be equal 0 

Perhaps one might argue that a "better" test might be to find 
R| from the rotation of Y, , B2 above, and then computes 

T / 

R, * S ♦ R, - D 

where should be nearly diagonal if S Is a reasonable sample from 
!C j then compare the diagonal elements of and A as described above 
for D and A . 



70 





/ 






I 






